RNA-Seq Data Analysis ◾ 167
downloaded and used for research purposes or for learning. Most RNA-Seq raw sequence
data are in FASTQ file format. When analyzing the RNA-Seq data, we must pay attention
to the design of the study. For instance, if the purpose is the differential gene expression,
there must be control raw data that we can use for comparison. The control raw data is
determined by the research goal. In conditions like cancers, researchers may use sequenc-
ing raw data of healthy tissue as control against the raw data of the affected tissue and
both from the same individual. However, researchers may also intend to compare gene
expression across individuals or samples. Most researchers include replicate samples in the
design of their study, and thus, there will be multiple raw data for a single sample. Replicate
samples will reduce errors generated by the laboratory technique used and also the possible
errors generated during the sequencing steps.
For practicing, we will use RNA-Seq raw data of a breast cancer study for differential
gene expression in tumor cells. The data is in six FASTQ files (three replicates for tumor
and three replicates for normal) containing paired-end reads of the size 151 bases. For the
sake of simplicity, the files include only the RNA-seq reads of chromosome 22. The data
was adapted to be as simple as possible, so its processing does not take too much time. To
keep the files organized, create a main directory “rnaseq” to be as the project directory and
create inside it the subdirectory “fastq”, and then, inside this subdirectory, download the
raw data from “https://github.com/hamiddi/ngs”. To avoid repetition, assume that the raw
data files have been cleaned from adaptors, duplicates, and the low-quality reads.
5.3.2 Read Mapping
The read mapping follows reprocessing and cleaning of the raw data. The accuracy of anal-
yses depends heavily on the read mapping. The mapping, as discussed in Chapter 2, is the
FIGURE 5.1 RNA-seq data analysis workflow.